Appendix B — Assignment B
Instructions
You may talk to a friend, discuss the questions and potential directions for solving them. However, you need to write your own solutions and code separately, and not as a group activity.
Do not write your name on the assignment.
Write your code in the Code cells and your answer in the Markdown cells of the Jupyter notebook. Ensure that the solution is written neatly enough to understand and grade.
Use Quarto to print the .ipynb file as HTML. You will need to open the command prompt, navigate to the directory containing the file, and use the command:
quarto render filename.ipynb --to html. Submit the HTML file.The assignment is worth 100 points, and is due on Thursday, 26th January 2023 at 11:59 pm.
Five points are properly formatting the assignment. The breakdown is as follows:
- Must be an HTML file rendered using Quarto (1 pt). If you have a Quarto issue, you must mention the issue & quote the error you get when rendering using Quarto in the comments section of Canvas, and submit the ipynb file.
- No name can be written on the assignment, nor can there be any indicator of the student’s identity—e.g. printouts of the working directory should not be included in the final submission (1 pt)
- There aren’t excessively long outputs of extraneous information (e.g. no printouts of entire data frames without good reason, there aren’t long printouts of which iteration a loop is on, there aren’t long sections of commented-out code, etc.) (1 pt)
- Final answers of each question are written in Markdown cells (1 pt).
- There is no piece of unnecessary / redundant code, and no unnecessary / redundant text (1 pt)
B.1 Multiple linear regression
A study was conducted on 97 men with prostate cancer who were due to receive a radical prostatectomy. The dataset prostate.csv contains data on 9 measurements made on these 97 men. The description of variables can be found here:
B.1.1 Training MLR
Fit a linear regression model with lpsa as the response and all the other variables as predictors. Write down the equation to predict lpsa based on the other eight variables.
(2+2 points)
B.1.2 Model significance
Is the overall regression significant at 5% level? Justify your answer.
(2 points)
B.1.3 Coefficient interpretation
Interpret the coefficient of svi.
(2 points)
B.1.4 Variable significance
Report the \(p\)-values for gleason and age. What do you conclude about the significance of these variables?
(2+2 points)
B.1.5 Variable significance from confidence interval
What is the 95% confidence interval for the coefficient of age? Can you conclude anything about its significance based on the confidence interval?
(2+2 points)
B.1.6 \(p\)-value
Fit a simple linear regression on lpsa against gleason. What is the \(p\)-value for gleason?
(1+1 points)
B.1.7 Predictor significance in presence / absence of other predictors
Is the predictor gleason statistically significant in the model developed in the previous question (B.1.6)?
Was gleason statistically significant in the model developed in the first question (B.1.1) with multiple predictors?
Did the statistical significance of gleason change in the absence of other predictors? Why or why not?
(1+1+4 points)
B.1.8 Prediction
Predict lpsa of a 65-year old man with lcavol = 1.35, lweight = 3.65, lbph = 0.1, svi = 0.22, lcp = -0.18, gleason = 6.75, and pgg45 = 25 and find 95% prediction intervals.
(2 points)
B.1.9 Variable selection
Find the largest subset of predictors in the model developed in the first question (B.1.1), such that their coefficients are zero, i.e., none of the predictors in the subset are statistically significant.
Does the model \(R\)-squared change a lot if you remove the set of predictors identifed above from the model in the first question (B.1.1)?
Hint: You may use the f_test() method to test hypotheses.
(4+1 points)
B.2 Using MLR coefficients and variable transformation
The dataset infmort.csv gives the infant mortality of different countries in the world. The column mortality contains the infant mortality in deaths per 1000 births.
B.2.1 Data visualisation
Make the following plots:
a boxplot of log(
mortality) againstregion(note that a plot of log(mortality) againstregionbetter distinguishes the mortality among regions as compared to a plot ofmortalityagainstregion,a boxplot of
incomeagainstregion, anda scatter plot of
mortalityagainstincome.
What trends do you see in these plots? Mention the trend separately for each plot.
(3+2 points)
B.2.2 Removing effect of predictor from response
Europe seems to have the lowest infant mortality, but it also has the highest per capita annual income. We want to see if Europe still has the lowest mortality if we remove the effect of income from the mortality. We will answer this question with the following steps.
B.2.2.1 Variable transformation
Plot:
mortalityagainstincome,log(
mortality) againstincome,mortalityagainst log(income), andlog(
mortality) against log(income).
Based on the plots, postulate an appropriate model to predict mortality as a function of income. Print the model summary.
(2+4 points)
B.2.2.2 Model update
Update the model developed in the previous question by adding region as a predictor. Print the model summary.
(2 points)
Use the model developed in the previous question to compute adjusted_mortality for each observation in the data, where adjusted mortality is the mortality after removing the estimated effect of income. Make a boxplot of log(adjusted_mortality) against region.
(4+2 points)
B.2.3 Data visualisation after removing effect of predictor from response
From the plot in the previous question:
Does Europe still seem to have the lowest mortality as compared to other regions after removing the effect of income from mortality?
After adjusting for income, is there any change in the mortality comparison among different regions. Compare the plot developed in the previous question to the plot of
log(mortality)againstregiondeveloped earlier (B.2.1) to answer this question.
Hint: Do any African / Asian / American countries seem to do better than all the European countries with regard to mortality after adjusting for income?
(1+3 points)
B.3 Variable transformations and interactions
The dataset soc_ind.csv contains the GDP per capita of some countries along with several social indicators.
B.3.1 Training SLR
For a simple linear regression model predicting gdpPerCapita. Which predictor will provide the best model fit (ignore categorical predictors)? Let that predictor be \(P\).
(2 points)
B.3.2 Linearity in relationship
Make a scatterplot of gdpPerCapita vs \(P\). Does the relationship between gdpPerCapita and \(P\) seem linear or non-linear?
(1 + 2 points)
B.3.3 Variable transformation
If the relationship identified in the previous question is non-linear, identify and include transformation(s) of the predictor \(P\) in the model to improve the model fit.
Mention the predictors of the transformed model, and report the change in the \(R\)-squared value of the transformed model as compared to the simple linear regression model with only \(P\).
(4+4 points)
B.3.4 Model visualisation with transformed predictor
Plot the regression curve of the transformed model (developed in the previous question) over the scatterplot in (b) to visualize model fit. Also make the regression line of the simple linear regression model with only \(P\) on the same plot.
(3 + 1 points)
B.3.5 Training MLR with qualitative predictor
Develop a model to predict gdpPerCapita with \(P\) and continent as predictors.
Interpert the intercept term.
For a given value of \(P\), are there any continents that do not have a signficant difference between their mean
gdpPerCapitaand that of Africa? If yes, then which ones, and why? If no, then why not? Consider a significance level of 5%.
(4 + 4 points)
B.3.6 Variable interaction
The model developed in the previous question has a limitation. It assumes that the increase in mean gdpPerCapita with a unit increase in \(P\) does not depend on the continent.
Eliminate this limitation by including interaction of
continentwith \(P\) in the model developed in the previous question. Print the model summary of the model with interactions.Interpret the coefficient of any one of the interaction terms.
(4 + 4 points)
B.3.7 Model visualisation with qualitative predictor
Use the model developed in the previous question to plot the regression lines for Africa, Asia, and Europe. Put gdpPerCapita on the vertical axis and \(P\) on the horizontal axis. Use a legend to distinguish among the regression lines of the three continents.
(4 points)
B.3.8 Model interpretation
Based on the plot develop in the previous question, which continent has the highest increase in mean gdpPerCapita for a unit increase in \(P\), and which one has the least? Justify your answer.
(2+2 points)